LNAI2111 



I David Helmbold 

Bob Williamson (Eds.) 



Computational 
Learning Theory 

14th Annual Conference 

on Computational Learning Theory, COLT 2001 

and 5th European Conference 

on Computational Learning Theory, EuroCOLT 2001 

Amsterdam, The Netherlands, July 2001 

Proceedings 



f|p) Springer 




Lecture Notes in Artificial Intelligence 2111 

Subseries of Lecture Notes in Computer Science 
Edited by J. G. Carbonell and J. Siekmann 

Lecture Notes in Computer Science 

Edited by G. Goos, J. Hartmanis and J. van Leeuwen 




Springer 

Berlin 

Heidelberg 

New York 

Barcelona 

Hong Kong 

London 

Milan 

Paris 

Singapore 

Tokyo 




David Helmbold Bob Williamson (Eds.) 



Computational 
Learning Theory 



14th Annual Conference 

on Computational Learning Theory, COLT 2001 

and 5th European Conference 

on Computational Learning Theory, EuroCOLT 2001 

Amsterdam, The Netherlands, July 16-19, 2001 

Proceedings 




Springer 




Series Editors 



Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA 
Jorg Siekmann, University of Saarland, Saarbriicken, Germany 

Volume Editors 
David Helmbold 

University of California, Santa Cruz 

School of Engineering, Department of Computer Science 

Santa Cruz, CA 95064, USA 

E-mail: dph@cse.ucsc.edu 

Bob Williamson 

Australian National University 

Research School of Information Sciences and Engineering 
Department of Telecommunications Engineering 
Canberra 0200, Australia 
E-mail: Bob.Williamson@anu.edu.au 



Cataloging-in-Publication Data applied for 
Die Deutsche Bibliothek - CIP-Einheitsaufnahme 

Computational learning theory : proceedings / 14th Annual Conference on 
Computational Learning Theory, COLT 2001 and 5th European Conference on 
Computational Learning Theory, EuroCOLT 2001, Amsterdam, The Netherlands, 
July 16 - 19, 2001. David Helmbold ; Bob Williamson (ed.). - Berlin ; 

Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; 
Singapore ; Tokyo : Springer, 2001 

(Lecture notes in computer science ; Vol. 2111: Lecture notes in 
artificial intelligence) 

ISBN 3-540-42343-5 



CR Subject Classification (1998): 1.2.6, 1.2.3, F.4.1, F.l.l, F.2 
ISBN 3-540-42343-5 Springer- Verlag Berlin Heidelberg New York 



This work is subject to copyright. All rights are reserved, whether the whole or part of the material is 
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, 
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication 
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, 
in its current version, and permission for use must always be obtained from Springer- Verlag. Violations are 
liable for prosecution under the German Copyright Law. 

Springer- Verlag Berlin Heidelberg New York 
a member of BertelsmannSpringer Science+Business Media GmbH 

http://www.springer.de 

© Springer-Verlag Berlin Heidelberg 2001 
Printed in Germany 

Typesetting: Camera-ready by author 

Printed on acid-free paper SPIN: 10839809 06/3142 5 4 3 2 1 0 




Preface 



This volume contains papers presented at the joint 14th Annual Conference on 
Computational Learning Theory and 5th European Conference on Computatio- 
nal Learning Theory, held at the Trippenhuis in Amsterdam, The Netherlands 
from July 16 to 19, 2001. 

The technical program contained 40 papers selected from 69 submissions. In 
addition, David Stork (Ricoh California Research Center) was invited to give an 
invited lecture and make a written contribution to the proceedings. 

The Mark Fulk Award is presented annually for the best paper co-authored 
by a student. This year’s award was won by Olivier Bousquet for the paper 
“Tracking a Small Set of Modes by Mixing Past Posteriors” (co-authored with 
Manfred K. Warmuth). 

We gratefully thank all of the individuals and organizations responsible for 
the success of the conference. We are especially grateful to the program com- 
mittee: Dana Angluin (Yale), Peter Auer (Univ. of Technology, Graz), Nello 
Christianini (Royal Holloway), Claudio Gentile (Universita di Milano), Lisa Hel- 
lerstein (Polytechnic Univ.), Jyrki Kivinen (Univ. of Helsinki), Phil Long (Na- 
tional Univ. of Singapore), Manfred Opper (Aston Univ.), John Slrawe-Taylor 
(Royal Holloway), Yoram Singer (Hebrew Univ.), Bob Sloan (Univ. of Illinois 
at Chicago), Carl Smith (Univ. of Maryland), Alex Smola (Australian National 
Univ.), and Frank Stephan (Univ. of Heidelberg), for their efforts in reviewing 
and selecting the papers in this volume. 

Special thanks go to our conference co-chairs, Peter Griinwald and Paul 
Vitanyi, as well as Marja Hegt. Together they handled the conference publi- 
city and all the local arrangements to ensure a successful conference. We would 
also like to thank ACM SIGACT for the software used in the program committee 
deliberations and Stephen Kwek for maintaining the COLT web site. 

Finally, we would like to thank The National Research Institute for Ma- 
thematics and Computer Science in the Netherlands (CWI), The Amsterdam 
Historical Museum, and The Netherlands Organization for Scientific Research 
(NWO) for their sponsorship of the conference. 



May 2001 David Helmbold 

Bob Williamson 
Program Co-clrairs 
COLT/EuroCOLT 2001 




Table of Contents 



How Many Queries Are Needed to Learn One Bit of Information? 1 

Hans Ulrich Simon 

Radial Basis Function Neural Networks Have Superlinear VC Dimension . . 14 
Michael Schmitt 

Tracking a Small Set of Experts by Mixing Past Posteriors 31 

Olivier Bousquet and Manfred K. Warmuth 

Potential-Based Algorithms in Online Prediction and Game Theory 48 

Nicold Cesa-Bianchi and Gabor Lugosi 

A Sequential Approximation Bound for Some Sample-Dependent Convex 

Optimization Problems with Applications in Learning 65 

Tong Zhang 

Efficiently Approximating Weighted Sums with Exponentially Many Terms 82 
Deepak Chawla, Lin Li, and Stephen Scott 

Ultraconservative Online Algorithms for Multiclass Problems 99 

Koby Crammer and Yoram Singer 

Estimating a Boolean Perceptron from Its Average Satisfying Assignment: 

A Bound on the Precision Required 116 

Paid W. Goldberg 

Adaptive Strategies and Regret Minimization in Arbitrarily Varying Markov 

Environments 128 

Shie Mannor and Nahum Shimkin 

Robust Learning — Rich and Poor 143 

John Case, Sanjay Jain, Frank Stephan, and Rolf Wiehagen 

On the Synthesis of Strategies Identifying Recursive Functions 160 

Sandra Zilles 

Intrinsic Complexity of Learning Geometrical Concepts from Positive Data 177 
Sanjay Jain and Efim Kinber 

Toward a Computational Theory of Data Acquisition and Trutlring 194 

David G. Stork 

Discrete Prediction Games with Arbitrary Feedback and Loss 208 

Antonio Piccolboni and Christian Schindelhauer 

Rademacher and Gaussian Complexities: 

Risk Bounds and Structural Results 224 

Peter L. Bartlett and Shahar Mendelson 

Further Explanation of the Effectiveness of Voting Methods: 

The Game between Margins and Weights 241 

Vladimir Koltchinskii, Dmitriy Panchenko, and Fernando Lozano 




VIII Table of Contents 



Geometric Methods in the Analysis of Glivenko-Cantelli Classes 256 

Shahar Mendelson 

Learning Relatively Small Classes 273 

Shahar Mendelson 

On Agnostic Learning with {0, *, 1}- Valued and Real- Valued Hypotheses . 289 
Philip M. Long 

When Can Two Unsupervised Learners Achieve PAC Separation? 303 

Paid W. Goldberg 

Strong Entropy Concentration, Game Theory, 

and Algorithmic Randomness 320 

Peter Griinwald 

Pattern Recognition and Density Estimation under the General i.i.d. 

Assumption 337 

Ilia Nouretdinov, Volodya Vovk, Michael Vyugin, and Alex Gammerman 

A General Dimension for Exact Learning 354 

Jose L. Balcazar, Jorge Castro, and David Guijarro 
Data-Dependent Margin-Based Generalization Bounds for Classification . . 368 
Balazs Kegl, Tamas Linder, and Gabor Lugosi 

Limitations of Learning via Embeddings in Euclidean Half-Spaces 385 

Shai Ben-David, Nadav Eiron, and Hans Ulrich Simon 
Estimating the Optimal Margins of Embeddings in Euclidean Half Spaces 402 
Jiirgen Forster, Niels Schmitt, and Hans Ulrich Simon 

A Generalized Representer Theorem 416 

Bernhard Scholkopf, Ralf Herbrich, and Alex J. Smola 
A Leave-One-out Cross Validation Bound for Kernel Methods 

with Applications in Learning 427 

Tong Zhang 

Learning Additive Models Online with Fast Evaluating Kernels 444 

Mark Herbster 

Geometric Bounds for Generalization in Boosting 461 

Shie Mannor and Ron Meir 

Smooth Boosting and Learning with Malicious Noise 473 

Rocco A. Servedio 

On Boosting with Optimal Poly-Bounded Distributions 490 

Nader H. Bshouty and Dmitry Gavinsky 

Agnostic Boosting 507 

Shai Ben-David, Philip M. Long, and Yishay Mansour 

A Theoretical Analysis of Query Selection for Collaborative Filtering 517 

Wee Sun Lee and Philip M. Long 

On Using Extended Statistical Queries to Avoid Membership Queries .... 529 
Nader H. Bshouty and Vitaly Feldman 




Table of Contents 



IX 



Learning Monotone DNF from a Teacher That Almost Does Not Answer 



Membership Queries 546 

Nader H. Bshouty and Nadav Eiron 

On Learning Monotone DNF under Product Distributions 558 

Rocco A. Servedio 

Learning Regular Sets with an Incomplete Membership Oracle 574 

Nader Bshouty and Avi Owshanko 

Learning Rates for Q-Learning 589 

Eyal Even-Dar and Yishay Mansour 

Optimizing Average Reward Using Discounted Rewards 605 

Sham Kakade 

Bounds on Sample Size for Policy Evaluation in Markov Environments . . . 616 
Leonid Peshkin and Sayan Mukherjee 

Author Index 631 




How Many Queries Are Needed 
to Learn One Bit of Information?* 



Hans Ulrich Simon 1 

Fakultat fur Mathematik, Ruhr-Universitat Bochum, D-44780 Bochum, Germany 

simonOlmi . ruhr-uni-bochum.de 



Abstract. In this paper we study the question how many queries are 
needed to “halve a given version space”. In other words: how many 
queries are needed to extract from the learning environment the one 
bit of information that rules out fifty percent of the concepts which are 
still candidates for the unknown target concept. We relate this problem 
to the classical exact learning problem. For instance, we show that lower 
bounds on the number of queries needed to halve a version space also 
apply to randomized learners (whereas the classical adversary arguments 
do not readily apply). Furthermore, we introduce two new combinato- 
rial parameters, the halving dimension and the strong halving dimen- 
sion, which determine the halving complexity (modulo a small constant 
factor) for two popular models of query learning: learning by a mini- 
mum adequate teacher (equivalence queries combined with membership 
queries) and learning by counterexamples (equivalence queries alone). 
These parameters are finally used to characterize the additional power 
provided by membership queries (compared to the power of equivalence 
queries alone). All investigations are purely information-theoretic and 
ignore computational issues. 



1 Introduction 

The exact learning model was introduced by Angluin in P|. In this model, a 
learner A tries to identify an unknown target concept C* (of the form C* : X — > 
{0, 1} for a finite set X) by means of queries that must be honestly answered 
by an oracle. Although the oracle must not lie, it may select its answers in a 
worstcase fashion such as to slow down the learning process as much as possible. 
In the (worstcase) analysis of A, we assume that the oracle is indeed an adversary 
of A that makes full use of this freedom. (In the sequel, we sometimes say 
“adversary” instead of “oracle” for this reason.) Furthermore, A must be able to 
identify any target concept selected from a (known) concept class C. Again, A is 
subjected to a worstcase analysis, i.e. , we count the number of queries needed to 
identify the hardest concept from C (that is the concept that forces A to invest 
a maximal number of queries) . 

Among the most popular query types are the following ones: 

* This work has been supported in part by the ESPRIT Working Group in Neural and 
Computational Learning II, NeuroCOLT2, No. 27150. The author was also supported 
by the Deutsche Forschungsgemeinschaft Grant SI 498/4-1. 



D. Helmbold and B. Williamson (Eds.): COLT/EuroCOLT 2001, LNAI 2111, pp. l-[ld| 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 



2 



H.U. Simon 



Equivalence Queries. A selects a hypothesis H from its hypothesis class H. 
(Typically, H = C or H is a superset of C.) If H = C*, the oracle an- 
swers “YES” (signifying that A suceeded to identify the target concept). 
Otherwise, the oracle returns a counterexample, i.e., an x £ X such that 
H(x) ^ C*(x). 

Membership Queries. A selects an x £ X and receives label C*( x) from the 
oracle. 

At each time of the learning process, the so-called version space V contains all 
concepts from C that do not contradict to the answers that have been received so 
far. Clearly, the learner succeeded to identify C* as soon as V = {C*}. It is well- 
known in the learning community that the task of identifying an unknown but 
fixed target concept from C is equivalent to the task of playing against another 
adversary who need not to commit itself to a target concept in the beginning. 
The answers of this adversary are considered as honest as long as they do not 
lead to an empty version space. The learner still tries to shrink the version space 
to a singleton and thereby to issue as few queries as possible. We will refer to 
this task as the “contraction task” (or the “contraction game”). At first glance, 
the contraction task seems to give more power to the adversary. However, if we 
assume that A is deterministic, both tasks require the same number of queries: 
since A is deterministic, one can “predict” which concept C* will form the final 
(singleton) version space in the hardest scenario of the contraction task. Now, it 
does not hurt the adversary, to commit itself to C* as the target concept in the 
beginning. 

Since randomized learning complexity will be an issue in this paper, we briefly 
illustrate that the initial commitment to a target concept is relevant when we 
allow randomized learners: 

Example 1. Consider the model of learning by means of equivalence queries. Let 
the concept and hypothesis class coincide with the set of all functions from X 
to {0, 1}, where X = {1, . . . , d}. Clearly, the contraction task forces each deter- 
ministic (or randomized) algorithm to issue d equivalence queries because each 
(non-redundant) query halves the version space. As for deterministic algorithms, 
the same remark is valid for the learning task. However the following randomized 
learning algorithm needs in the average only d/2 queries: 

Pick a first hypothesis Hq : X — > {0, 1} uniformly at random. Given that the 
current hypothesis is H and that counterexample x £ X is received, let the next 
hypothesis H' coincide with H on X \ {cc} and set H'(x ) = 1 — H{x). 

The number of queries needed to identify an arbitrary (but fixed) target concept 
C* equals the number of instances on which C* and Hq disagree. This is d/2 in 
the average. 

This example demonstrates that the typical adversary arguments, that are used 
in the literature for proving lower bounds on the number of queries, do not 
readily apply to randomized learners^ 

1 To the best of our knowledge, almost all papers devoted to query learning assume 
deterministic learning algorithms. A notable exception is the paper of Maass that 



How Many Queries Are Needed to Learn One Bit of Information? 



3 



The main issue in this paper is the number of queries needed to halve (as 
opposed to contract) an initial version space Vo- There are several reasons for 
this kind of research: 

— Contraction of the version space by iterated halving is considered as very 
efficient. Iterated halving is an important building stone of well known strate- 
gies such as the “Majority Vote Strategy”, for instance. The binary search 
paradigm is based on halving. Halving may therefore be considered as an 
interesting problem in its own right. 

— Halving the version space yields exactly one bit of information. In this sense, 
we explore the hardness to extract one bit of information from the learning 
environment. This sounds like an elementary and natural problem. 

— Although the contraction task is not meaningful for randomized learners, 
we will be able to show that the halving task is meaningful. This makes 
adversary arguments applicable to randomized learning algorithms. 

— We can characterize the halving complexity for two popular query types 
(equivalence and membership queries) by tight combinatorial bounds (leav- 
ing almost no gap)Q These bounds can be used to characterize the additional 
power provided by membership queries (compared to the power of equiva- 
lence queries alone). 

The paper is organized as follows. In Section 0 we present the basic defini- 
tions and notations. In Section 0 we view the tasks of learning, contraction and 
halving as a game between the learner (contraction algorithm, halving algorithm, 
respectively) and an adversary. In Section 0 we investigate the relation between 
halving and learning complexity (including randomized learning complexity) . In 
Section 0 we present the combinatorial (lower and upper) bounds on the halving 
complexity. In Section 0 these bounds are used to characterize the additional 
power provided by membership queries (compared to the power of equivalence 
queries alone). 



2 Basic Definitions and Notations 

Let X be a finite set and C,1i be two families of functions from X to {0,1}. 
In the sequel, we refer to X as the instance space, to C as the concept class, 
and to H as the hypothesis class. It is assumed that C CH. A labeled instance 
(x, b) € X x {0, 1} is called a sample-point. A sample is a collection of sample- 
points. For convenience, we represent each sample S' as a partially defined binary 

demonstrates the significance of supporting examples in the on-line learning model, 
when the learner is randomized and the learning environment is oblivious. 

2 The derivation of these bounds is based on ideas and results from m, where bounds 
on the number of queries needed for learning (i.e, for contracting the version space) 
are presented. These bounds, however, leave a gap. It seems that the most accurate 
combinatorial bounds are found on the level of the halving task. (See also 0 for a 
survey on papers presenting upper and lower bounds on query complexity.) 



4 



H.U. Simon 



function over domain X. More formally, S is of the form S : X — > {0, 1, ?}, where 
S(x) =? indicates that S is undefined on instance x. The set 

supp(S') = {x £ X ■. S(x) yf?} (1) 

is called the support of S. Note that a concept or hypothesis can be viewed as a 
sample with full support. The size of S is the number of instances in its support. 
S' is called subsample of S, denoted as S' C S, if supp(S') C supp(S') and 
S'( x) = S(x) for each instance x £ supp(S'). We say that sample S and concept 
C are consistent if S C C. We say that S has a consistent explanation in C if 
there exists a concept C £ C such that S and C are consistent. The terminology 
for hypotheses is analogous. 

In the exact learning model, a learner (learning algorithm) A has to identify 
an unknown target concept C* £ C by means of queries. The query learning 
process can be informally described as follows. Each query must be honestly 
answered by an oracle. Learning proceeds in rounds. In each round, A issues the 
next query and obtains an honest answer from the oracle. The current version 
space V is the set of concepts from C that do not contradict to the answers 
received so far. Initially, V = C. From round to round, V shrinks. However, at 
least the target concept C* always belongs to V because the answers given by 
the oracle are honest. The learning process stops when V = {C*}. 

For the sake of simplicity, we formalize this general framework only for the 
following popular models of exact learning: 

Equivalence Query Learning (EQ-Learning). Each allowed query can be 
identified with a hypothesis H £ R. If H = C*, the only honest answer is 
“YES” (signifying that the target concept has been exactly identified by A) . 
Otherwise, an honest answer is a counterexample to H, i.e., an instance x 
such that H(x) yf C*( x). 

Membership Query Learning (MQ-Learning). Each allowed query can be 
identified with an instance x £ X. The only honest answer is C*(:r). 
EQ-MQ-Learning. The learner may issue both types of queries. 

Let V be the current version space. If A issues a membership query with instance 
x and receives the binary label b , then the subsequent version space is given by 

V[x, b\ = {C£V : C(x) = b}. (2) 

Similarly, if A issues an equivalence query with hypothesis H and receives the 
counterexample x, then the subsequent version space is given by 

V[H, x] = {C £ V : C(x) ± H(x)}. (3) 

Clearly, answer “YES” to an equivalence query immediately leads to the final 
version space {C*}. 

In general, V[Q,R] denotes the version space resulting from the current ver- 
sion space V when A issues query Q and receives answer R. We denote by Q 




How Many Queries Are Needed to Learn One Bit of Information? 



5 



the set of queries from which Q must be selected. Given C,"H, a collection Q 
of allowed queries, and a deterministic learner A, we define DLC®(C,"H) as the 
following unique number q : 

— There exists a target concept C* € C and a sequence of honest answers to 
the queries selected by A such that the learning process does not stop before 
round q. 

— For each target concept C* £ C and each sequence of honest answers to the 
queries selected by A, the learning process stops after round q or earlier. 

In other words, DLC is the smallest number q of queries such that A is 
guaranteed to identify any target concept from C with hypotheses from T-L using 
q queries from Q. The deterministic learning complexity associated with C,"H, Q 
is given by 

DLC q (C,H) = mjn DLC (4) 
where A varies over all deterministic learners. 



3 Games Related to Learning 

Since we measure the number of queries needed by the learner in a worstcase 
fashion, we can model the learning process as a game between two players: 
the learner A and its adversary ADV. We use the notation ADV^ to indicate 
the strongest possible adversary of A. We begin with a rather straightforward 
interpretation of exact learning as a game. 



3.1 The Learning Game 

C,T~L and Q are fixed and known to both players. The game proceeds as follows. 
In a first move (invisible to A ), ADV picks the target concept C* from C. After- 
wards, both players proceed in rounds. In each round, first player A makes its 
move by selecting a query from Q. Then, ADV makes its move by selecting an 
honest answer. The game is over when the current version space does not con- 
tain any concept different from (7*. The goal of A is to finish the game as soon 
as possible, whereas the goal of ADV is to continue playing as long as possible. 
A is evaluated against the strongest adversary ADV^ that forces A to make a 
maximum number of moves (or the maximum expected number of moves in the 
case of a randomized learner). 

It should be evident that the number of rounds in the learning game between 
a deterministic learner A and ADVa coincides with the quantity DLC ®(C,"H) 
that was defined in the previous section. Thus, DLC S (C,"H) coincides with the 
number of rounds in the learning game between the best deterministic learner 
and its adversary. 

We define RLC®(C,"H) as the expected number of rounds in the learning 
game between the (potentially) randomized learner A and its strongest adversary 




6 



H.U. Simon 



ADV^ElThe randomized learning complexity associated with C,TL, Q is given by 
RLC Q (C,H) = min RLC (5) 

A 

where A varies over all (potentially) randomized learners. 

3.2 The Contraction Game 

It is well known that, in the case of deterministic learners A , the learning game 
can be replaced by a conceptually simpler game, differing from the learning game 
as follows: 

— The first move of ADV is omitted, i.e., ADV makes no commitment about 
the target concept in the beginning. 

— Each (syntactically correct )□ answer that does not lead to an empty version 
space is honest. 

— The game is over when the version space is a singleton. 

Again, the goal of player A is to finish the game as soon as possible, whereas the 
goal of the adversary is to finish as late as possible. A is evaluated against its 
strongest adversary ADVa- We will refer to this new game as the contraction 
game and to A as a contraction algorithm. 

The following lemmas recall some well known facts (in a slightly more general 
setting) . 

Lemma 1. As for the contraction game, there exist two deterministic optimal 
players A* and ADV*, i.e., the following holds: 

1. Let A he any (potentially randomized) contraction algorithm. Then, ADV * 
forces A to make at least as many moves as A*. 

2. Let ADV he any (potentially randomized) oracle. Then A* needs no more 
moves against ADV than against ADV*. 

The proof uses a standard argument which is given here for sake of completeness. 

Proof. Consider the decision tree T that models the moves of both players. 
Each node of T is of type either Q or 1Z (signifying which player makes the next 
move) . Each node of type Q is marked by a version space (reflecting the actual 
configuration of the contraction game), and each node of type TZ is marked by 
a version space and a question (again reflecting the actual configuration of the 
game including the last question of A). The structure of T can be inductively 
described as follows: 

3 Here, ADVa knows the program of A, but A determines its next move by means 
of secret random bits. Thus, ADVa knows the probability distribution of the future 
moves, but cannot exactly predict them. This corresponds to what is called “weakly 
oblivious environment” in t 5j. 

4 E.g., the answer to an equivalence (or membership) query must be an instance from 
X (or a binary label, respectively). 



How Many Queries Are Needed to Learn One Bit of Information? 



7 



— Its root is of type Q and marked C (the initial version space in the contraction 
game) . 

— A node of type Q that is marked by a singleton (version space of size 1) is 
a leaf (signifying that the game is over). 

— Each inner node v of type Q that is marked V has k children v[Q \], . . . , v[Qk], 
where Q 1 , . . . , Qk denote the non-redundant questions that A is allowed to 
issue at this stage. Node v[Qi] is of type 1Z and marked (V, Qi). 

— Each inner node w of type TZ that is marked (V, Q ) has l children u>[I?i], . . ., 

where R\,. . . ,Ri denote the honest answers of ADV to question Q at 
this stage. Node w[Rj } is of type Q and marked V[Q, Rj ] (the version space 
resulting from V when A issues query Q and receives answer Rj). 

It is easy to describe deterministic optimal strategies for both players in a 
bottom-up fashion. At each node of T, the optimal decisions for A and ADV 
result from the following rules: 

— Each leaf is labeled 0 (signifying that no more moves of A are needed to 
finish the game). 

— If a node w of type TZ that is labeled (V, Q) has children ifylfy], . . . , w[Ri] la- 
beled (n i, . . . , ni), respectively, and n 3 = max{ni, . . . , n{\, then w is labeled 
rij. Furthermore, ADV should answer Rj to question Q , given that V is the 
current version space. 

— If a node v of type Q that is labeled V has children u[Qi], . . . , v[Qk] labeled 
(mi, . . . , TO*,), respectively, and m ? ; = min{mi, . . . , mfc}, then v is labeled 
1 + rrii. Furthermore, A should ask question Qi , given that V is the current 
version space. 

Note that these rules can be made deterministic by resolving the ties in an 
arbitrary deterministic fashion. It is easy to prove for each node v of T (by 
induction on the height of v) that the following holds: 

If v is marked V, then the rules specify two deterministic optimal players (in 
the sense of Lemma DJ for the partial contraction game that starts with initial 
version space V. The bottom-up label associated with v specifies the number of 
rounds in this partial game when both player follow the rules. 

The extrapolation of this claim to the root node of T yields Lemma QJ 

Lemma n implies that A* is the best contraction algorithm among all (possibly 
randomized) algorithms. (Remember that each algorithm A is evaluated against 
its strongest adversary ADV^.) It implies also that ADV* is the strongest ad- 
versary of A* . 

Lemma 2. DLC®(C,"H) coincides with the number of rounds, say q *, in the 
contraction game between A* and ADV*. 

The proof of this lemma (given here for sake of completeness) is well known in 
the learning community and is, in fact, the justification of the popular adversary 
arguments within the derivation of lower bounds on the number of queries needed 
by deterministic learners. 



H.U. Simon 



Proof. The contraction game coincides with the learning game, except for the 
commitment that the adversary has to make in the first step of the learning 
game: the selection of a target concept C* £ C. Thus, DLC®(C,%) < q*. It 
suffices therefore to show that for each deterministic learner A there exists an 
adversary ADV^ that forces at least g* moves of A. 

To this end, let A be an arbitrary, but fixed, deterministic learner. Let AD V* 
be the optimal deterministic adversary in the contraction game that was de- 
scribed in the proof of Lemma d Let A play against ADV* in the contraction 
gameU Let q > g* be the number of queries needed by A to finish the contrac- 
tion game against player ADV*, and let C* be the unique concept in the final 
(singleton) version space. Now we may use an adversary ADVa in the learning 
game that selects C* as a target concept in the beginning and then simulates 
ADV*. Since A is deterministic, this will lead to the same sequence of moves as 
in the contraction game. Thus, ADV can force q > q* moves of A. 

Note that a lower bound argument can deal with a sub-optimal (but, may be, 
easier to analyze) adversary ADV (instead of ADV*). Symmetrically, an up- 
per bound argument may use a sub-optimal (but, may be, easier to analyze) 
contraction algorithm A (instead of A*). 

We briefly remind the reader to Example d If C contains all functions from 
{1, . . . , d} to {0, 1}, then Example d shows that 

DLC EQ (C,C) = d and RLC EQ (C,C) < d/2. 

In the light of Lemmas d and HI this demonstrates that the contraction game 
does not model the learning game when randomized learners are allowed. 

3.3 The Halving Game 

The halving game is defined like the contraction game except that it may start 
with an arbitrary initial version space Vo C C (known to both players), and it is 
over as soon as the current version space V contains at most half of the concepts 
of Vo. Player A (called halving algorithm in this context) tries to halve Vo as 
fast as possible. Player ADVa is its strongest adversary. 

Like in the contraction game, there exist two optimal deterministic players: 
A* (representing the optimal halving algorithm) and ADV* (wich is also the 
strongest adversary for A*). (Compare with Lemma Q) Let HC s (Vo,'H) be 
defined as the number of rounds in the halving game between A* and ADV*. 
In other words, HC s (Vo,H) is the smallest number of queries that suffices to 
halve the initial version space Vo when all queries are answered in a worstcase 
fashion. This parameter has the disadvantage of being not monotonic: a subset 
of Vo might be harder to halve than Vo itself. In order to force monotonicity, we 
define the halving complexity associated with C,"H, Q as follows: 

HC®(C, H) = max{HC c (V, H) : V C C} (6) 

5 This looks like a dirty trick because A is an algorithm that expects to play the learn- 
ing game. We will however argue later that A cannot distinguish the communication 
with ADV, from the communication with an adversary ADV in the learning game. 



How Many Queries Are Needed to Learn One Bit of Information? 



9 



The relation between halving and learning is the central issue of the next 
section. 



4 Halving and Learning Complexity 



Theorem 1. The following chain of inequalities is valid: 



1 

2 



< RLC^iC^n) 


( 7 ) 


< max {HCf(C, TL), RLC^(C, %)} 


(8) 


< DLC a {C,'H) 


( 9 ) 


< Liog|c|j -Hcf(c,n). 


( 10 ) 



Proof. We begin with the proof of inequality (0. Let A be an arbitrary (po- 
tentially randomized) learning algorithm. Let V* C C be the version space such 
that 

9. =HC®(C,^) =HC Q (V*,?t). 

Let ADV* be the optimal deterministic adversary for the problem of halving 
V». Thus, ADV* forces each halving algorithm to issue at least q* queries. In 
order to derive a lower bound on RLC S (C,'H), we use an adversary ADV that 
proceeds as follows: 

First move. Pick a target concept (7* £ V* uniformly at random. Release the 
information that the target concept is taken from V* to A. 

Subsequent moves. Simulate ADV* as long as its answers do not exclude 
target concept C* from the resulting version space. As soon as C* would be 
excluded, abort the simulation and give up. 

We say that the learning algorithm A “wins” in q moves against ADV if it takes 
q moves until A either has identified the target concept C* or forced ADV to 
give up. It suffices to show that the expected number of moves needed for A to 
win against ADV is larger than q* / 2. 

Note that the behaviour of ADV* does not depend on the first move of ADV. 
We may therefore alternatively run the complete simulation of ADV* in a first 
stage, pick the target concept (7* £ V uniformly at random in a second stage, 
and undo the final illegal rounds with C* not belonging to the version space in a 
third stage. Let q be the number of rounds in stage 1, and let V; be the version 
space after i rounds. Since the halving game stops as soon as the initial version 
space V* is halved, V g _i still contains more than half of the concepts in V. Thus, 
with a probability exceeding 1/2, the target concept C* is taken from V g _i. In 
this case, the learning algorithm A cannot win in fewer than q rounds. It follows 
that the expected number of rounds until the learner wins is larger than q* / 2. 

We proceed with the proof of inequality (0). Deterministic query learning 
is correctly modelled by the contraction game. Since the contraction of an ini- 
tial version space V* cannot be easier than halving V*, we get DLC Q (C,TL) > 
HC ®(C,"H). Clearly, DLC ®(C,W) > RLC Q (C,H). Thus, inequality © is valid. 



10 



H.U. Simon 



We move on to inequality m ■ Contraction of the initial version space C is 
obtained after [log |C|J halvings. A learning algorithm A may therefore proceed 
in {log|C|J phases and apply, in each phase with current version space V, the 
optimal deterministic halving strategy w.r.t. initial version space V. In each 
phase, at most HC f(C,TL) queries are issued. 

Since inequality (01 is trivial, we are done. 

5 Bounds on the Halving Complexity 

For Q = {EQ,MQ} or Q = {EQ}, the halving complexity can be character- 
ized by combinatorial parameters that we call “halving dimension” and “strong 
halving dimension” 0 These parameters are closely related to the “consistency 
dimension” and “strong consistency dimension” that were used in 0 for de- 
scribing lower and upper bounds on the deterministic learning complexity. The 
bounds in QJ left a gap of size 0(log |C|). The bounds that we derive in the course 
of this section are (almost) tight. 

Definition 1. 1. The parameter Hdim{V ,TL) denotes the smallest number d 

with the following property. If a function S : X — > {0, 1} (sample with full 
support) does not belong to TL, then there exists a subsample S' Q S of size 
at most d such that the fraction of concepts in V that are consistent with S' 
is at most 1/2. The halving dimension associated with C and TL is given by 

Hdim*{C ,TL) = ma x{Hdim(V,TL) : V C C}. (11) 

2. The parameter SHdim{V ,TL) denotes the smallest number d with the follow- 
ing property. If a sample S : X — > {0,1,?} has no consistent explanation 
in TL, then there exists a a subsample S' C S of size at most d such that 
the fraction of concepts in V that are consistent with S' is at most 1/2. The 
strong halving dimension associated with C,TL , Q is given by 

SHdim^C, TL) = m&x{SHdim(V, TL ) : V C C}. (12) 

Note that both definitions are almost identical, except for the subtle fact that 
the first definition ranges over samples with full support, whereas the second 
definition ranges over all samples. The next theorems show that the halving 
dimension characterizes the halving complexity when Q = {EQ,MQ}, and the 
strong halving dimension characterizes the halving complexity when Q = {EQ}. 

Theorem 2. HC EQ,MQ (V,TL) = Hdim(V,H). 

Proof. For sake of simplicity, set q = HC (V,TL) and d = Hdim(V, TL). 

First, we show that q > d. The minimality of d implies that there exists 
a function C : X — > {0,1} TL such that each subsample S C C of size at 
most cZ — 1 is consistent with more than half of the concepts in V. Thus, any 



Note that the case Q = {EQ, MQ} and TL = 0 covers also the case Q = {MQ}. 



How Many Queries Are Needed to Learn One Bit of Information? 



11 



halving algorithm issuing equivalence queries with hypotheses from T~L fails to 
be consistent with C and may obtain counterexamples taken from C . If, in 
addition, each membership query is answered in accordance with C, then the 
sample points, returned after up to d — 1 queries, form a subsample S' C C of 
size at most d— 1. Hence, at least one additional query is needed for halving the 
initial version space V. 

Second, we show that q < d. Let M : X — > {0, 1} be a function that goes 
with the majority of the concepts in V on each instance x £ X (breaking ties 
arbitrarily). If M £ "H, then q = 1 because the equivalence query with hypothesis 
M will halve the version space V. Clearly, d > 1. We may therefore assume 
wlog. that M It follows that there exists a subsample S C M of size at most 
d that is inconsistent to at least half of the concepts in V. The crucial observation 
is that q < IS) < d since V can be halved by issuing the 151 membership queries 
for all instances in S: The adversary either fails to go with the majority of the 
concepts in V on some x £ S (which immediately halves V) or goes with the 
majority on all x £ S. In the latter case, V is halved after all |5| membership 
queries have been issued. 

Corollary 1. HCf Q ’ MQ (C,H) = Hdim*{C,H). 

Theorem 3. SHdim(V,H) < H(f Q {V,H) < |"ln(4) • SHdim(V, H)~\ . 

Proof. For sake of simplicity, set q = HC E ®(V,'H) and d = SHdim(V,"H). 

First, we show that q > d. The minimality of d implies that there exists a 
sample C : X — > {0,1,?} without consistent explanation in P such that each 
subsample S C C of size at most d — 1 is consistent with more than half of 
the concepts in V. Thus, any halving algorithm issuing equivalence queries with 
hypotheses from 'H fails to be consistent with C and may obtain counterexamples 
taken from C. After up to d— 1 queries, these counterexamples form a subsample 
S C C of size at most d — 1. Hence, at least one additional query is needed for 
halving the initial version space V. 

In order to prove q < [ In (4) • d] , we describe an appropriate halving algorithm 
A. A keeps track of the current version space W (which is V initially). For i = 0,1, 
let A/{y be the set 

| a; £ X : the fraction of concepts C £ W with C( x) = 1 — i is less than ^ 

In other words, a very large fraction (at least 1— l/(2d)) of the concepts in V votes 
for output label i on instances from Myy . Let Myy be the sample assigning label 
i £ {0,1} to all instances from Myy and label “?” to all remaining instances 
(those without a so clear majority). Let S C Myy be an arbitrary but fixed 
subsample of size at most d. The definition of M { v implies (through some easy- 
to-check counting) that more than half of the concepts in W are consistent 
with S. The definition of the strong halving dimension implies that Afyy has a 
consistent explanation, say Hyy, in TL. 




12 



H.U. Simon 



The punchline of this discussion is: if A issues the the equivalence query with 
hypothesis -ffyy (for the current version space W), then the next counterexample 
will shrink W by a factor 1 — l/(2d) (or by a smaller factor). For the purpose 
of halving the initial version space V, a sufficiently large number of equivalence 
queries is therefore obtained by solving 



for q' . Clearly, q = |"ln(4) • d\ is sufficiently large. 

Corollary 2. SHdim*(C,H) < HCf Q (C,H) < |"ln(4) • SHdim»(C, H){ ■ 

6 An Application of the Halving Dimension 

In this section, we show that the number equivalence queries needed to halve 
a given version space V (roughly) equals the total number of equivalence and 
membership queries needed to halve V on the “hardest subdomain” K of X. 
Loosely speaking, there is always a subdomain K of X that leaves the problem of 
halving V by means of equivalence queries as as hard as before, but which renders 
membership queries useless. A similar result was shown in 0 for contraction 
(deterministic exact learning). However, this result left a gap of size 0(log|C|). 
The result proven here leaves only a (small) constant gap. 

Let K C X. For each function F : X —>■ {0,1}, let Fx denote the restriction 
of F to subdomain K. For each class T of functions from X to {0, 1}, we define 

Tk = {Fk : F £ F}. Then the following holds: 

Lemma 3. SHdim(V ,'H) = max^c x Hdim(Vx,'Hx)- 

Proof. Set d = SHdim(V, H). Remember that d is the smallest number such that 
for all samples S : X — > {0,1,?} the condition described in the second part of 
Definition ^ is satisfied. Let dx be the corresponding number when we restrict 
ourselves to samples S with support K. It is evident that d = maxKcx dx and 
dx = Hdim(VA', T~Lx), which completes the proof of the lemma. 

Combining Lemma 0 Corollary [J] and Corollary Q we get 

Corollary 3. 



ma x HC EQ ’ MQ (VK,n K ) < HC® Q (V,^) < ln(4) • max HC EQ ’ MQ (V K ,'H K )- 



Among the obvious open problems are the following-ones: 

— The relation between halving and learning complexity that is proven in this 
paper leaves a gap. Can this gap be removed (at least for some concrete 
popular classes)? 

— Can the (strong) halving dimension be determined for some popular concept 
and hypothesis classes? 




xcx 



KCX 



How Many Queries Are Needed to Learn One Bit of Information? 



13 



References 

1. Dana Angluin. Queries and concept learning. Machine Learning, 2(4):319-342, 
1988. 

2. Jose L. Balcazar, Jorge Castro, David Guijarro, and Hans U. Simon. The consis- 
tency dimension and distribution-dependent learning from queries. In Proceedings 
of the 10th International Workshop on Algorithmic Learning Theory, pages 77-92. 
Springer Verlag, 1999. 

3. Tibor Hegediis. Generalized teaching dimensions and the query complexity of learn- 
ing. In Proceedings of the 8th Annual Workshop on Computational Learning Theory , 
pages 108-117. ACM Press, 1995. 

4. Lisa Hellerstein, Krishnan Pillaipakkamnatt, Vijay Raghavan, and Dawn Wilkins. 
How many queries are needed to learn? Journal of the Association on Computing 
Machinery, 43(5):840-862, 1996. 

5. Wolfgang Maass. On-line with an oblivious environment and the power of random- 
ization. In Proceedings of the fth Annual Workshop on Computational Learning 
Theory, pages 167-175, 1991. 




Radial Basis Function Neural Networks Have 
Superlinear VC Dimension* 



Michael Schmitt 

Lehrstuhl Mathematik und Informatik, Fakultat fur Mathematik 
Ruhr-Universitat Bochum, D-44780 Bochum, Germany 
http : / /www . ruhr-uni-bochum . de/lmi/mschmitt/ 
mschmittOlmi . ruhr-uni-bochum . de 



Abstract. We establish superlinear lower bounds on the Vapnik-Cher- 
vonenkis (VC) dimension of neural networks with one hidden layer and 
local receptive field neurons. As the main result we show that every rea- 
sonably sized standard network of radial basis function (RBF) neurons 
has VC dimension f2(W log k), where W is the number of parameters 
and k the number of nodes. This significantly improves the previously 
known linear bound. We also derive superlinear lower bounds for net- 
works of discrete and continuous variants of center-surround neurons. 
The constants in all bounds are larger than those obtained thus far for 
sigmoidal neural networks with constant depth. 

The results have several implications with regard to the computational 
power and learning capabilities of neural networks with local receptive 
fields. In particular, they imply that the pseudo dimension and the fat- 
shattering dimension of these networks is superlinear as well, and they 
yield lower bounds even when the input dimension is fixed. The methods 
developed here appear suitable for obtaining similar results for other 
kernel-based function classes. 



1 Introduction 



Although there exists already a large collection of Vapnik-Chervonenkis (VC) 
dimension bounds for neural networks, it has not been known thus far whether 
the VC dimension of radial basis function (RBF) neural networks is superlinear. 
Major reasons for this might be that previous results establishing superlinear 
bounds are based on methods geared to sigmoidal neurons or consider networks 
having an unrestricted number of layers till Oil .'fl22| . RBF neural networks, how- 
ever, differ from other neural network types in two characteristic features (see, 
e.g., Bishop 0, Ripley HD) : There is only one hidden layer and the neurons 
have local receptive fields. In particular, the neurons are not of the sigmoidal 
type (see Koiran and Sontag m for a rather general definition of a sigmoidal 
activation function that does not capture radial basis functions). 



This work has been supported in part by the ESPRIT Working Group in Neural 
and Computational Learning II, NeuroCOLT2, No. 27150. 



D. Helmbold and B. Williamson (Eds.): COLT/EuroCOLT 2001, LNAI 2111, pp. 14-|S1J 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




Radial Basis Function Neural Networks Have Superlinear VC Dimension 



15 



Beside sigmoidal networks, RBF networks are among the major neural net- 
work types used in practice. They are appreciated because of their impressive 
capabilities in function approximation and learning that have been well studied 
in theory and practice (see, e.g., |5l9ll5ll6ll8ll9l2(| j. Sigmoidal neural networks 
are known having VC dimension that is superlinear in the number of network pa- 
rameters, even when there is only one hidden layer j22j. Since the VC dimension 
of single neurons is linear, this super linearity witnesses the enormous compu- 
tational capabilities that emerge when neurons cooperate in networks. The VC 
dimension of RBF networks has been studied earlier by Anthony and Holden 2j , 
Lee et al. HUE!, and Erlich et al. jZJ. In particular, Erlich et al. jZj established 
a linear lower bound, and Anthony and Holden 0 posed as an open problem 
whether a superlinear bound can be shown. 

In this paper we prove that the VC dimension of RBF networks is indeed 
superlinear. Precisely, we show that every network with n input nodes, W pa- 
rameters, and one hidden layer of k RBF neurons, where k < 2^ ra + 2 V 2 i has VC 
dimension at least (W/12) log(fc/8). Thus, the cooperative network effect en- 
hancing the computational power of sigmoidal networks is now confirmed for 
RBF networks, too. Furthermore, the result has consequences for the complex- 
ity of learning with RBF networks, all the more since it entails the same lower 
bound for the pseudo dimension and the fat-shattering dimension. (See Anthony 
and Bartlett jpQ] for implications of VC dimension bounds, and the relationship 
between the VC dimension, the pseudo dimension, and the fat-shattering dimen- 
sion.) 

We do not derive the result for RBF networks directly but take a major de- 
tour. We first consider networks consisting of a different type of locally process- 
ing units, the so-called binary center-surround receptive field (CSRF) neurons. 
These are discrete models of neurons found in the visual system of mammals 
(see, e.g., Nicholls et al. HU Chapter 16], Tessier-Lavigne pH ]). In Section 01 we 
establish a superlinear VC dimension bound for CSRF neural networks showing 
that every network having W parameters and k hidden nodes has VC dimension 
at least (IV/5) log(fc/4), where k < 2”/ 2 . Then in Section 0 we look at a contin- 
uous variant of the CSRF neuron known as the difference-of-Gaussians (DOG) 
neuron, which computes the weighted difference of two concentric Gaussians. 
This type of unit is widely used as a continuous model of neurons in the visual 
pathway (see, e.g., Glezer 0, Marr M)- Utilizing the result for CSRF networks 
we show that DOG networks have VC dimension at least (IU/5) log(fc/4) as well. 
Finally, the above claimed lower bound for RBF networks is then immediately 
obtained. 

We note that regarding the constants, the bounds for CSRF and DOG net- 
works are larger than for RBF networks. Further, all bounds we derive for net- 
works of local receptive field neurons have larger constant factors than those 
known for sigmoidal networks of constant depth thus far. For comparison, sig- 
moidal networks are known that have one hidden layer and VC dimension 
at least (W/32) log(fc/4); for two hidden layers a VC dimension of at least 
(W/132) log(fc/16) has been found (see Anthony and Bartlett [Q Section 6.3]). 



16 



M. Schmitt 



Finally, the results obtained here give rise to linear lower bounds for local recep- 
tive field neural networks when the input dimension is fixed. 



2 Definitions and Notation 

Let | |tt| | denote the Euclidean norm of vector u. A Gaussian radial basis function 
(RBF) neuron computes a function ^rbf : IR 2 ™ +1 —> IEt defined as 

9 rbf(c, a,x) = exp 

with input variables X \, . . . , x n , and parameters Ci, . . . , c„ (the center) and a > 0 
(the width). A difference-of-Gaussians (DOG) neuron is defined as a function 
<7 dog : Et 2 " +4 IR computed by the weighted difference of two RBF neurons 
with equal centers, that is, 

Sdog(c, ct, t, a, (3, x ) = ck^rbf (c, cr, x) - /? 5 rbf(c, t, x) . 

A binary center- surround receptive field ( CSRF) neuron computes a function 
ffCSRF : TR 2n+2 —1 {0, 1} defined as 

5 csrf(c, a, b,x) = 1 <=> a < \\x - c|| < b 

with center (ci, . . . , c n ), center radius a > 0, and surround radius b > a. We also 
refer to it as off-center on-surround neuron and call for given parameters c,a,b 
the set {a: : <?csrf(c, a, b, x) = 1} the surround region of the neuron. 

We consider neural networks with one hidden layer computing functions of 
the form / : JR^ -1 "” — > ]R, where W is the number of network parameters, n the 
number of input nodes, and 

f(w,y,x) =w 0 + wihi(y,x) H 1- w k h k (y,x) . 

The k hidden nodes compute functions hi , ... , h k £ {<?rbFi 3dog, <7CSRf}- (Each 
hidden node “selects” its parameters from y, which comprises all parameters of 
the hidden nodes.) Note that if h t = c/rbf for i = 1, . . . , k, this is the standard 
form of a radial basis function neural network. The network has a linear output 
node with parameters Wo, ... ,w k also known as the output weights. For simplicity 
we sometimes refer to all network parameters as weights. 

An (n — l)-dimensional hyperplane in IR" is given by a vector (wo , . . . , w n ) £ 
IR" +1 and defined as the set 

{x £ IR n : wq + w\X\ + • • • + w n x n = 0} . 

An (n — l)-dimensional hypersphere in IR" is represented by a center c £ IR" 
and a radius r > 0, and defined as the set 




{a; £ IR™ : \\x — c|| = r} . 




Radial Basis Function Neural Networks Have Superlinear VC Dimension 



17 



We also consider hyperplanes and hyperspheres in JR n with a dimension k < 
n — 1. In this case, a fc-dimensional hyperplane (hypersphere) is the intersec- 
tion of two (k + 1) -dimensional hyperplanes (hypersplreres), assuming that the 
intersection is non-empty. (For hyperspheres we additionally require that the 
intersection is not a single point.) 

A dichotomy of a set S C IR" is a pair (So, S i) of subsets such that Sod Si = 0 
and SoUSi = S. A class T of functions mapping IR™ to {0, 1} shatters S if every 
dichotomy (So, Si) of S is induced by some / C T (i.e., satisfying /(So) C 
{0} and /(Si) C {1}). The Vapnik-Chervonenkis (VC) dimension of T is the 
cardinality of the largest set shattered by T . The VC dimension of a neural 
network M is defined as the VC dimension of the class of functions computed 
by N , where the output is made binary using some threshold. 

We use “In” to denote the natural logarithm and “log” for the logarithm to 
base 2. 

3 Lower Bounds for CSRF Networks 

In this section we consider one-hidden-layer networks of binary center-surround 
receptive field neurons. The main result requires a property of certain finite sets 
of points. 

Definition. A set S of m points in IR™ is said to be in spherically general 
position if the following two conditions are satisfied: 

(1) For every k < min (n,m — 1) and every ( k + l)-element subset PCS, there 
is no ( k — 1)- dimensional hyperplane containing all points in P. 

(2) For every l < min (n,m — 2) and every (l + 2 )-element subset Q C S, there 
is no (/ — 1) -dimensional hypersphere containing all points in Q. 

Sets satisfying condition (1) are commonly referred to as being “in general 
position” (see, e.g., Cover 0). For the VC dimension bounds we require suffi- 
ciently large sets in spherically general position. It is not hard to show that such 
sets exist. 

Proposition 1. For every n,m> 1 there exists a set S C IR n of m points in 
spherically general position. 

Proof. We perform induction on m. Clearly, every single point trivially satisfies 
conditions (1) and (2). Assume that some set S C ]R™ of cardinality m has been 
constructed. Then by the induction hypothesis, for every k < min (n,m), every 
fc-element subset PCS does not lie on a hyperplane of dimension less than 
k — 1. Hence, every PCS, |P| = k < min (n, to), uniquely specifies a (fc — 1)- 
dimensional hyperplane Ftp that includes P. The induction hypothesis implies 
further that no point in S\P lies on Hp. Analogously, for every l < min(n, to— 1), 
every (l + l)-element subset Q C S does not lie on a hypersphere of dimension 
less than l — 1. Thus, every Q C S, |Q| = l + 1 < min (n, to — 1) + 1, uniquely 



18 



M. Schmitt 



determines an (Z — l)-dimensional hypersphere Bq containing all points in Q and 
none of the points in S\Q. 

To obtain a set of cardinality m + 1 in spherically general position we observe 
that the union of all hyperplanes and hyperspheres considered above, that is, the 
union of all Hp and all Bq for all subsets P and Q, has Lebesgue measure 0. 
Hence, there is some point s G JR” not contained in any hyperplane Hp and 
not contained in any hypersphere Bq. By adding s to S we then obtain a set of 
cardinality m + 1 in spherically general position. □ 

The following theorem establishes the major step for the superlinear lower 
bound. 

Theorem 2. Let h,q,m > 1 be arbitrary natural numbers. Suppose Af is a 
network with one hidden layer consisting of binary CSRF neurons, where the 
number of hidden nodes is h+2 q and the number of input nodes is m+q. Assume 
further that the output node is linear. Then there exists a set of cardinality 
hq(m + 1) shattered by A f . This even holds if the output weights of Af are fixed 
to 1. 

Proof. Before starting with the details we give a brief outline. The main idea is 
to imagine the set we want to shatter as being composed of groups of vectors, 
where the groups are distinguished by means of the first m components and the 
remaining q components identify the group members. We catch these groups by 
hyperspheres such that each hypersphere is responsible for up to to + 1 groups. 
The condition of spherically general position will ensure that this works. The 
hyperspheres are then expanded to become surround regions of CSRF neurons. 
To induce a dichotomy of the given set, we split the groups. We do this for each 
group using the q last components in such a way that the points with designated 
output 1 stay within the surround region of the respective neuron and the points 
with designated output 0 are expelled from it. In order for this to succeed, we 
have to make sure that the displaced points do not fall into the surround region 
of some other neuron. The verification of the split operation will constitute the 
major part of the proof. 

By means of Proposition 10 let (si, . . . , Sh( m +i)} Q 11™ be in spherically 
general position. Let e\,.. .,e q denote the unit vectors in IR 9 , that is, with a 1 
in exactly one component and 0 elsewhere. We define the set S by 

S = {si'.i= 1, . . . , h(m + 1)} x {ej : j = 1, . . . , q} . 

Clearly, S is a subset of JR m+9 and has cardinality hq(m + 1). To show that S 
is shattered by Af, let (So, Si) be some arbitrary dichotomy of S. Consider an 
enumeration Mi, . . . , M 21 of all subsets of the set {1, . . . , q}. Let the function 
/ : {1, . . . , h(m + 1)} — > {1, . . . , 2 9 } be defined by 

Mf^i) {j . s^Cj G Si} , 

where Siej denotes the vector resulting from the concatenation of Si and ej. We 
use / to define a partition of (si, . . . , Sfc,(m+i)} into sets Tk for k = 1, . . . , 2 q by 

Tk = {Si : f(i) = k} . 



Radial Basis Function Neural Networks Have Superlinear VC Dimension 



19 



We further partition each set T k into subsets T k , p for p = 1, . . . , \\T k \/(m + 1)], 
where each subset T ktP has cardinality m + 1, except if m + 1 does not divide 
|Tfc|, in which case there is exactly one subset of cardinality less than m + 1. 
Since there are at most h(m + 1) elements Sj, the partitioning of all T k results 
in no more than h subsets of cardinality m + 1. Further, the fact k < 2 q permits 
at most 2 q subsets of cardinality less than m+ 1. Thus, there are no more than 
h + 2 q subsets T k ^ p . 

We employ one hidden node H k ^ p for each subset T k ^ p . Thus the h+2 q hidden 
nodes of N suffice. Since {si, . . . , s/i( m +i)} is in spherically general position, 
there exists for each T kp an (m — l)-dimensional hypersplrere containing all 
points in T k}P and no other point. If |Tfc ]P | = m + 1, this hypersphere is unique; if 
|Tfc jP | < m+1, there is a unique (|7i- >p | — 2)-dimensional hypersplrere which can 
be extended to an (m — l)-dimensional hypersplrere that does not contain any 
further point. (Here we require condition (1) from the definition of spherically 
general position, otherwise no hypersplrere of dimension \T k ^ p \ — 2 including all 
points of T k p might exist.) Clearly, if |T) ; p | = 1, we can also extend this single 
point to an ( m — l)-dimensional hypersplrere not including any further point. 

Suppose that ( c k , p ,r k , P ) with center Ck, P and radius r ktP represents the lry- 
persplrere associated with subset T kp . It is obvious from the construction above 
that all radii satisfy r k , P > 0. Further, since the subsets T k ^ p are pairwise dis- 
joint, there is some e > 0 such that every point .s, £ {si, . . . , S; l ( Tra +r)} and every 
just defined hypersplrere (c k ^ p ,r ktP ) satisfy 

if Si qL T kiP then | ||sj - c fc , p || - r ktP \ > e . (1) 

In other words, e is smaller than the distance between any s,; and any hypersplrere 
( Ck, P ,r kjP ) that does not contain s*. Without loss of generality we assume that 
e is sufficiently small such that 



£ < min r k , p ■ (2) 

k,p 

The parameters of the hidden nodes are adjusted as follows: We define the 
center c ktP = (c k , p , i, . . . , c k . p . rn+q ) of hidden node H k p by assigning the vector 
c kjP to the hrst to components and specifying the remaining ones by 

^ f o if? e M k , 

k,p,m+j | — £ 2 /4 otherwise , 



for j = 1, ... ,q. We further dehne new radii r ktP by 



n, P = \j r l. P + (q - \Mk\) + 1 

and choose some 7 > 0 satisfying 



7 < min ^ — . 

8 r k . p 



(3) 




20 



M. Schmitt 



The center and surround radii a ktP , bk, P of the hidden nodes are then specified 
as 



&k,p — ^k,p 7 ; 
h, P = r k , P + 7 • 

Note that d kjP > 0 holds, because £ 2 < r 2 p implies 7 < r k , p . 

This completes the assignment of parameters to the hidden nodes H kp . We 
now derive two inequalities concerning the relationship between e and 7 that we 
need in the following. First, we estimate £ 2 /2 from below by 

£ 2 £ 2 £ 2 
2" > T + 64 

^ 

> tt + 7o^ — v2 for a11 k ’P ■ 

4 (8 r k ,p) 2 

where the last inequality is obtained from £ 2 < r kp . Using © for both terms 
on the right-hand side, we get 

£ 2 

— > 2f fciP 7 + 7 2 for all k,p . (4) 

Second, from (0 we get 

£ 2 £ 2 

-r k , P £ + y < for all k.p , 

and 0 yields 

£ 2 ^ 

< —2f ktP y for all k,p . 

Putting the last two inequalities together and adding j 2 to the right-hand side, 
we obtain 

£ 2 

- r k , P £ + y < —2fk, P l + 7 2 f° r all k,p . (5) 

We next establish three facts about the hidden nodes. 

Claim 1. Let Siej be some point and T kp some subset where £ T k p and 
j £ M k . Then hidden node H kp outputs 1 on Siej. 

According to the definition of Cfc jP , if j £ M k , we have 

II Siej - c k J 2 = ||sj - c fe]P || 2 + (q- \M k \) + 1 . 

The condition s,; £ T k p implies ||sj — Cfc jP || 2 = , and thus 

II S i e j ~ c fe,pll*" = r k,p + — l-^fcl) (2) 4" 1 

= ? 2 
' k,p ' 



Radial Basis Function Neural Networks Have Superlinear VC Dimension 



21 



It follows that || Siej — Cfc jP || = ffc )P , and since dk, P < Tk, P < bk tP , point Sjej lies 
within the surround region of node Hk )P . Hence, Claim ^is shown. 

Claim 2. Let s^ej and Tk iP satisfy Si £ Tk, p and j $ M k- Then hidden node 
Hk, P outputs 0 on Siej. 

From the assumptions we get here 

\\ s i e j - c fclP || 2 = \\ Si - c k , P f + (q- \M k \ - 1) (|) 4 + (l + j) 

= r k, P + (<? — l-^fcl) (2) 




Employing © on the right-hand side results in 

II s^j - CkJ 2 > ?l p + 2f fc , p 7 + 7 2 . 

Hence, taking square roots we have ||s'je 7 - — Ck }P || > ff.p + 7, implying that s^j 
lies outside the surround region of Hk )P . Thus, Claim El follows. 

Claim 3. Let Siej be some point and Tk tP some subset such that Si £ Tk tP . Then 
every hidden node Hk>y with (k',p') 7^ (k,p) outputs 0 on s^j. 

Since s % £ Tk, p and s, is not contained in any other subset Tk’ y , condition © 
implies 

||s» - c*/,p'|| 2 > {r k >y +e) 2 or ||sj - c fc / iP / 1| 2 < (rfe'.p' - e) 2 . (6) 

We distinguish between two cases: whether j £ Mk> or not. 

Case 1. If j £ Mk> then by the definition of Ck'y we have 

II s^j -c k >, p > || 2 = ||sj - c k >y || 2 + (q- |M fe /|) + 1 . 

From this, using (0) and the definition of ff 1 y we obtain 

|| s^j -c k 'y || 2 > r\, v , +2 r k 'y£ + £ 2 

or (7) 

|| s^j -c k 'y || 2 < r\, v , — 2r k 'ys + e 2 . 

We derive bounds for the right-hand sides of these inequalities as follows. From 
© we have 



e 2 > 4ffe' jP '7 + 2q 2 



22 



M. Schmitt 



which, after adding 2ry y£ to the left-hand side and halving the right-hand side, 
gives 

2 ryye + £ 2 > Zryy'y + 7 2 . ( 8 ) 

From o we get e 2 /2 < ryye, that is, the left-hand side of 0) is negative. 
Hence, we may double it to obtain from @ 

-2 ryye + £ 2 < — 2 fv jP '7 + y 2 . 

Using this and (JEI) in o leads to 

|| £j.£j Ofcqp'H > {ry y “t~ 7 ) or || £i£j Cap'll ( £k , ,p ' 7 ) 

And this is equivalent to 

1 1 -S ?; £j or || Sj£j Cfcqp'll ^ dk',p' , 

meaning that Hyy outputs 0. 

Case 2. If j (jL My then 

II s i e j — c fc',p'll“ = II s * — Cfe'.p' l | 2 + (9 — l-^fc'l) (2) ~2 • 

As a consequence of this together with © and the definition of ryy we get 

e 2 

\\Si£j ~ Cyy || 2 > r\t y + 2ry y£ + £ 2 + — 

or 

£ 2 

1 1 SiCy — Cyy || < r k iy-2ryy£-\-£ +— , 

from which we derive, using for the second inequality £ < ryy from 0), 

£ 2 

II s i e j ~ c fc',p'll > r y y A ry ye + 

or (9) 

£ 2 

ll s i e j - Cyy II 2 < r%,y - ryy£ + — . 

Finally, from © we have 

£ 2 2 
ry y£ + > %ryy + 7 , 

and, employing this together with ©, we obtain from © 

ll'SiG; > (Vk' ,p r H - T) ^ Cfc'jp'H <C {Vk' ,p' 5 



Radial Basis Function Neural Networks Have Superlinear VC Dimension 



23 



which holds if and only if 

S’ bfc' y Ol’ ||SiCj Cfc'jp' || <C CLk'y • 

This shows that Hk'y outputs 0 also in this case. Thus, ClaimElis established. 

We complete the construction of Af by connecting every hidden node with 
weight 1 to the output node, which then computes the sum of the hidden node 
output values. 

We finally show that we have indeed obtained a network that induces the 
dichotomy (So, Si). Assume that syj € S\. Claims mQ anclEJimply that there 
is exactly one hidden node -fffc,p, namely one satisfying k = f(i) by the definition 
of /, that outputs 1 on Siej. Hence, the network outputs 1 as well. On the other 
hand, if s^ej £ So, it follows from Claims 0 and 0 that none of the hidden nodes 
outputs 1. Therefore, the network output is 0. Thus, Af shatters S with threshold 
1/2 and the theorem is proven. □ 

The construction in the previous proof was based on the assumption that 
the difference between center radius and surround radius, given by the value 2 y, 
can be made sufficiently small. This may require constraints for the precision of 
computation that are not available in natural or artificial systems. It is possible, 
however, to obtain the same result even if there is a lower bound on 7 by simply 
scaling the elements of the shattered set using a sufficiently large factor. 

In the following we obtain a superlinear lower bound for the VC dimension 
of networks with center-surround receptive field neurons. By [_a:J we denote the 
largest integer less or equal to x. 

Corollary 3. Suppose Af is a network with one hidden layer of k binary CSRF 
neurons and input dimension n > 2 , where k < 2 n , and assume that the output 
node is linear. Then A f has VC dimension at least 




This even holds if the weights of the output node are not adjustable. 

Proof. We use Theorem |2| with h = \k/ 2 J , q = |_log(A:/2)J , and m = n — 
|_log(fc/2)J. The condition k < 2" guarantees that m > 1. Then there is a set of 
cardinality 



hq(m + 1 ) 




that is shattered by the network specified in Theorem Q Since the number of 
hidden nodes is h + 2 q < k and the input dimension is to + q = n, the network 
satisfies the required conditions. Furthermore, it was shown in the proof of The- 
orem Q that all weights of the output node can be fixed to 1. Hence, they need 
not be adjustable. □ 



24 



M. Schmitt 



Corollary 0 immediately implies the following statement, which gives a su- 
perlinear lower bound in terms of the number of weights and the number of 
hidden nodes. 

Corollary 4. Consider a network Af with input dimension n > 2, one hidden 
layer of k binary CSRF neurons, where k < 2™/ 2 , and a linear output node. Let 
W = k(n + 2) + k + 1 denote the number of weights. Then Af has VC dimension 
at least 






This even holds if the weights of the output node are fixed. 

Proof. According to Corollary 0 A f has VC dimension at least \k/2\ ■ [log(A:/2)J • 
(n — [log(A;/2)J + 1). The condition k < 2 n / 2 implies 



n — 



lo S ( I 



+ 1 > 



We may assume that k > 5. (The statement is trivial for k < 4.) It follows, using 
[k/ 2J > (k — l)/2 and fc/10 > 1/2, that 



k 

2 




Finally, we have 




Hence, A f has VC dimension at least (n + 4) (A/5) log(fc/4), which is at least as 
large as the claimed bound (W/5) log(fc/4). □ 

In the networks considered thus far the input dimension was assumed to be 
variable. It is an easy consequence of Theorem 0 that even when n is constant, 
the VC dimension grows still linearly in terms of the network size. 

Corollary 5. Assume that the input dimension is fixed and consider a network 
Af with one hidden layer of binary CSRF neurons and a linear output node. Then 
the VC dimension of Af is fi(k) and ft(W), where k is the number of hidden 
nodes and W the number of weights. This even holds in the case of fixed output 
weights. 

Proof. Choose m,q > 1 such that m + q < n, and let h = k — 2 q . Since n 
is constant, hq{m + 1) is C{k). Thus, according to Theorem 0 there is a set 
of cardinality fi(k) shattered by Af. Since the number of weights is 0(k ), the 
bound fi(W) also follows. □ 



Radial Basis Function Neural Networks Have Superlinear VC Dimension 



25 



4 Lower Bounds for RBF and DOG Networks 

In the following we present the lower bounds for networks with one hidden layer 
of Gaussian radial basis function neurons and difference-of-Gaussians neurons, 
respectively. We first consider the latter type. 

Theorem 6. Let h,q,m > 1 be arbitrary natural numbers. Suppose Af is a 
network with m + q input nodes, one hidden layer of h + 2 q DOG neurons, and 
a linear output node. Then there is a set of cardinality hq(m + 1) shattered by 
Af. 

Proof. We use ideas and results from the proof of Theorem Q In particular, we 
show that the set constructed there can be shattered by a network of new model 
neurons, the so-called extended Gaussian neurons which we introduce below. 
Then we demonstrate that a network of these extended Gaussian neurons can 
be simulated by a network of DOG neurons, which establishes the statement of 
the theorem. 

We define an extended Gaussian neuron with n inputs to compute the func- 
tion g : ]R 2n+2 — > 1R with 




where Xi, . . . ,x n are the input variables, ci, . . . , c n , a, and a > 0 are real- valued 
parameters. Thus, the computation of an extended Gaussian neuron is performed 
by scaling the output of a Gaussian RBF neuron with a, squaring the difference 
to 1, and comparing this value with 1. 

Let S C IR m+<? be the set of cardinality hq(m + 1) constructed in the proof 
of Theorem |3 In particular, S has the form 

S = {siej : i = 1, . . . , h(m + 1); j = 1, . . . , q} . 

We have also defined in that proof binary CSRF neurons Hk, p as hidden nodes 
in terms of parameters Ck tP € IR rra+l? , which became the centers of the neurons, 
and € IR, which gave the center radii dk, P = rk, P ~ 7 and the surround radii 
bk,p = Cfc.p + 7 using some 7 > 0. The number of hidden nodes was not larger 
than h + 2 q . We replace the CSRF neurons by extended Gaussian neurons 
with parameters Ck tP1 <Jk iP ,oik,p defined as follows. Assume some a > 0 that will 
be specified later. We let 



Ck,p — Ck,p 1 

& k,p — & 5 

a k , p = exp 

These hidden nodes are connected to the output node with all weights being 1. 
We call this network A f and claim that it shatters S. Consider some arbitrary 




26 



M. Schmitt 



dichotomy (So, Si) of S and some £ S. Then node Gk, P computes 



C^fc.p, SiCj') — 1 
= 1 



= 1 



^a fcj p exp 




|| Si&j Ck,p 



a 



2 

k,p 




2 




• exp 




II s i e j c fc,p|| 2 \ 
a 2 ) 



2 





ll^ej c^ ? p|| 



a 



2 





( 10 ) 



Suppose first that s^ej G S±. It was shown by Claims EDI! and 0 in the proof of 
Theorem 0 that there is exactly one hidden node Hk, p that outputs 1 on s-iCj. 
In particular, the proof of Claim 0 established that this node satisfies 



II s i e j c fc,pll — r k, P ■ 

Hence, according to (113 node Gk, P outputs 1. We note that this holds for all 
values of a. Further, the proofs of Claims 0 and 0 yielded that those nodes Hk lP 
that output 0 on s^ej satisfy 

c^pH (^"fc,p T t) or <1 (r'fcjp 'y) 

This implies for the computation of Gk, p that in m we can make the expression 



exp 





as close to 0 as necessary by choosing a sufficiently small. Since this does not 
affect the node that outputs 1, network A/ 7 computes a value close to 1 on SiCj. 

On the other hand, for the case Sje^- £ Sq it was shown in Theorem 0 that all 
nodes Hk, p output 0. Thus, if a is sufficiently small, each node Gk, P , and hence 
A/"', outputs a value close to 0. Hence, S is shattered by thresholding the output 
of A f at 1/2. 

Finally, we show that S can be shattered by a network J\f of the same size 
with DOG neurons as hidden nodes. The computation of an extended Gaussian 
neuron can be rewritten as 



g(c , a, a, x) = 1 — ( a exp I — 



x - c 



- 1 



= 1 — la 2 exp — 



2||a;- c|| 



— 2a exp ( — 



x - c 



+ 1 



= 2a exp [ — 



x - c 



_ a >expi-fch!: 



= gnoG(c,a,a/V2,2a,a 2 ,x) . 



Radial Basis Function Neural Networks Have Superlinear VC Dimension 



27 



Hence, the extended Gaussian neuron is equivalent to a weighted difference of two 
Gaussian neurons with center c, widths a, a /\/2 and weights 2a, a 2 , respectively. 
Thus, the extended Gaussian neurons can be replaced by the same number of 
DOG neurons. □ 



We note that the network of extended Gaussian neurons constructed in the 
previous proof has all output weights fixed, whereas the output weights of the 
DOG neurons, that is, the parameters a and (3 in the notation of Section |3 are 
calculated from the parameters of the extended Gaussian neurons and, therefore, 
depend on the particular dichotomy to be implemented. (It is trivial for a DOG 
network to have an output node with fixed weights since the DOG neurons have 
built-in output weights.) 

We are now able to deduce a superlinear lower bound on the VC dimension 
of DOG networks. 



Corollary 7. Suppose Af is a network with one hidden layer of DOG neurons 
and a linear output node. Let Af have k hidden nodes and input dimension n > 2, 
where k <2 n . Then Af has VC dimension at least 




Let W denote the number of weights and assume that k < 2 n / 2 . Then the VC 
dimension of J\f is at least 




For fixed input dimension the VC dimension of Af is bounded by f2(k) and f2(W). 

Proof. The results are implied by Theorem 0 in the same way as Corollaries 0 
0 and 0 follow from Theorem |5j □ 

Finally, we have the lower bound for Gaussian RBF networks. 

Theorem 8. Suppose Af is a network with one hidden layer of Gaussian RBF 
neurons and a linear output node. Let k be the number of hidden nodes and n 
the input dimension, where n> 2 and k < 2 n+1 . Then Af has VC dimension at 
least 




Let W denote the number of weights and assume that k < 2^ n+2 )/' 2 . Then the 
VC dimension of Af is at least 




For fixed input dimension n > 2 the VC dimension of Af satisfies the bounds 
Q(k) and L2(W). 



28 



M. Schmitt 



Proof. Clearly, a DOG neuron can be simulated by two weighted Gaussian RBF 
Neurons. Thus, by virtue of Theorem 0 there is a network Af with m + q input 
nodes and one hidden layer of 2(h + 2 q ) Gaussian RBF neurons that shatters 
some set of cardinality hq(m + 1). Choosing h = [k/4\,q = |_log(/c/4)J , and 
m = n — [_log(/c/4)J we obtain similarly to Corollary 0 the claimed lower bound 
in terms of n and k. Furthermore, the stated bound in terms of W and k follows 
by analogy to Corollary 0. The bound for fixed input dimension is obvious, as 
in the proof of Corollary 0 □ 

Some radial basis function networks studied theoretically or used in practice 
have no adjustable width parameters (for instance in [SEES)- Therefore, a natural 
question is whether the previous result also holds for networks with fixed width 
parameters. The values of the width parameters for Theorem 0 arise from the 
widths of DOG neurons specified in Theorem O The two width parameters of 
each DOG neuron have the form a and cr /y/2 where a is common to all DOG 
neurons and is only required to be sufficiently small. Hence, we can choose a 
single a that is sufficiently small for all dichotomies to be induced. Thus, for 
the RBF network we not only have that the width parameters can be fixed, but 
even that there need to be only two different width values — solely depending on 
the architecture and not on the particular dichotomy. 

Corollary 9. Let Af be a Gaussian RBF network with n input nodes and k 
hidden nodes satisfying the conditions of Theorem 0 Then there exists a real 
number a n ^ > 0 such that the VC dimension bounds stated in Theorem^ hold 
for Af with each RBF neuron having fixed width ak, n or crk,n/V 2. 

With regard to Theorem 0 we further remark that k has been previously es- 
tablished as lower bound for RBF networks by Anthony and Holden j2j. Further, 
also Theorem 19 of Lee et al. El in connection with the result of Erlich et al. 
0 implies the lower bound 12(nfc), and hence for fixed input dimension. 

By means of Theorem 0 we are now able to present a lower bound that is even 
superlinear in k. 

Corollary 10. Let n> 2 and Af be the network with k = 2 n+1 hidden Gaussian 
RBF neurons. Then Af has VC dimension at least 




Proof. Since k = 2 n+1 , we may substitute n = logfc — 1 in the first bound of 
Theorem 0 Hence, the VC dimension of Af is at least 




Using |fc/4j > fc/6 and [log(fc/4)J > log(/c/8) yields the claimed bound. □ 



Radial Basis Function Neural Networks Have Superlinear VC Dimension 



29 



5 Concluding Remarks 

We have shown that the VC dimension of every reasonably sized one-hidden- 
layer network of RBF, DOG, and binary CSRF neurons is superlinear. It is not 
difficult to deduce that the bound for binary CSRF networks is asymptotically 
tight. For RBF and DOG networks, however, the currently available methods 
give only rise to the upper bound 0(W 2 k 2 ). To narrow the gap between upper 
and lower bounds for these networks is an interesting open problem. 

It is also easy to obtain a linear upper bound for the single neuron in the 
RBF and binary CSRF case, whereas for the DOG neuron the upper bound is 
quadratic. We conjecture that also the DOG neuron has a linear VC dimension, 
but the methods currently available do not seem to permit an answer. 

The bounds we have derived involve constant factors that are the largest 
known for any standard neural network with one hidden layer. This fact could be 
evidence of the higher cooperative computational capabilities of local receptive 
field neurons in comparison to other neuron types. This statement, however, 
must be taken with care since the constants involved in the bounds are not yet 
known to be tight. 

RBF neural networks compute a particular type of kernel-based functions. 
The method we have developed for obtaining the results presented here is of quite 
general nature. We expect it therefore to be applicable for other kernel-based 
function classes as well. 

References 

1. Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical 
Foundations. Cambridge University Press, Cambridge, 1999. 

2. Martin Anthony and Sean B. Holden. Quantifying generalization in linearly 
weighted neural networks. Complex Systems, 8:91-114, 1994. 

3. Peter L. Bartlett, Vitaly Maiorov, and Ron Meir. Almost linear VC dimension 
bounds for piecewise polynomial networks. Neural Computation, 10:2159-2173, 
1998. 

4. Christopher M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, 
Oxford, 1995. 

5. D. S. Broomhead and David Lowe. Multivariable functional interpolation and 
adaptive networks. Complex Systems, 2:321-355, 1988. 

6. Thomas M. Cover. Geometrical and statistical properties of systems of linear 
inequalities with applications in pattern recognition. IEEE Transactions on Elec- 
tronic Computers, 14:326-334, 1965. 

7. Yossi Erlich, Dan Chazan, Scott Petrack, and Avraham Levy. Lower bound on 
VC-dimension by local shattering. Neural Computation, 9:771-776, 1997. 

8. Vadim D. Cllezer. Vision and Mind: Modeling Mental Functions. Lawrence Erl- 
baum, Mahwah, New Jersey, 1995. 

9. Eric J. Hartman, James D. Keeler, and Jacek M. Kowalski. Layered neural net- 
works with Gaussian hidden units as universal approximations. Neural Computa- 
tion, 2:210-215, 1990. 

10. Pascal Koiran and Eduardo D. Sontag. Neural networks with quadratic VC di- 
mension. Journal of Computer and System Sciences, 54:190-198, 1997. 




30 



M. Schmitt 



11. Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. Lower bounds on the 
VC dimension of smoothly parameterized function classes. Neural Computation, 
7:1040-1053, 1995. 

12. Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. Correction to “Lower 
bounds on VC-dimension of smoothly parameterized function classes”. Neural 
Computation, 9:765-769, 1997. 

13. Wolfgang Maass. Neural nets with super-linear VC-dimension. Neural Computa- 
tion, 6:877-884, 1994. 

14. David Marr. Vision: A Computational Investigation into the Human Representa- 
tion and Processing of Visual Information. Freeman, New York, 1982. 

15. H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic 
functions. Neural Computation, 8:164-177, 1996. 

16. John Moody and Christian J. Darken. Fast learning in networks of locally-tuned 
processing units. Neural Computation, 1:281-294, 1989. 

17. John G. Nicholls, A. Robert Martin, and Bruce G. Wallace. From Neuron to Brain: 
A Cellular and Molecular Approach to the Function of the Nervous System. Sinauer 
Associates, Sunderland, Mass., third edition, 1992. 

18. Jooyoung Park and Irwin W. Sandberg. Approximation and radial-basis-function 
networks. Neural Computation, 5:305-316, 1993. 

19. Tomaso Poggio and Federico Girosi. Networks for approximation and learning. 
Proceedings of the IEEE, 78:1481-1497, 1990. 

20. M. J. D. Powell. The theory of radial basis function approximation in 1990. In 
Will Light, editor, Advances in Numerical Analysis II: Wavelets, Subdivision Al- 
gorithms, and Radial Basis Functions, chapter 3, pages 105-210. Clarendon Press, 
Oxford, 1992. 

21. B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University 
Press, Cambridge, 1996. 

22. Akito Sakurai. Tighter bounds of the VC-dimension of three layer networks. In 
Proceedings of the World Congress on Neural Networks, volume 3, pages 540-543. 
Erlbaum, Hillsdale, New Jersey, 1993. 

23. Marc Tessier-Lavigne. Phototransduction and information processing in the retina. 
In Eric R. Kandel, James H. Schwartz, and Thomas M. Jessell, editors, Principles 
of Neural Science, chapter 28, pages 400-418. Prentice Hall, Englewood Cliffs, New 
Jersey, third edition, 1991. 




Tracking a Small Set of Experts 
by Mixing Past Posteriors* 



Olivier Bousquet 1 and Manfred K. Warmuth 2 

1 Centre de Mathematiques Appliquees 
Ecole Polytechnique 
91128 Palaiseau 
France 

bousquet@cmapx . polytechnique . f r 
2 Computer Science Department 
University of California, Santa Cruz 
Santa Cruz, CA 95064 
U.S.A. 

manf redOcse . ucsc . edu 



Abstract. In this paper, we examine on-line learning problems in which 
the target concept is allowed to change over time. In each trial a master 
algorithm receives predictions from a large set of n experts. Its goal is to 
predict almost as well as the best sequence of such experts chosen off-line 
by partitioning the training sequence into fc+1 sections and then choosing 
the best expert for each section. We build on methods developed by 
Herbster and Warmuth and consider an open problem posed by Freund 
where the experts in the best partition are from a small pool of size m. 
Since k » m the best expert shifts back and forth between the experts 
of the small pool. We propose algorithms that solve this open problem 
by mixing the past posteriors maintained by the master algorithm. We 
relate the number of bits needed for encoding the best partition to the 
loss bounds of the algorithms. Instead of paying logn for choosing the 
best expert in each section we first pay log bits in the bounds for 
identifying the pool of m experts and then log to bits per new section. In 
the bounds we also pay twice for encoding the boundaries of the sections. 



1 Introduction 

We consider the following standard on-line learning model in which a master 
algorithm has to combine the predictions from a set of experts piraanj . 
Learning proceeds in trials. In each trial the master receives the predictions 
from n experts and uses them to form its own prediction. At the end of the 
trial both the master and the experts receive the true outcome and incur a 
loss measuring the discrepancy between their predictions and the outcome. The 
master maintains a weight for each of its experts. The weight of an expert is 

* Supported by NSF grant CCR 9821087. This research was done while the first author 
was visiting UC Santa Cruz 



D. Helmbold and B. Williamson (Eds.): COLT/EuroCOLT 2001, LNAI 2111, pp. 31-^3 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 



32 



O. Bousquet and M.K. Warmuth 



an estimate of the “quality” of this expert’s predictions and the master forms 
its prediction based on a weighted combination of the expert’s predictions. The 
master updates the expert’s weights at the end of each trial based on the losses 
of the experts and master. 



The goal is to design weight updates that guarantee that the loss of the 
master is never much larger than the loss of the best expert or the best convex 
combination of the losses of the experts. So here the best expert or convex 
combination serves as a comparator. 



A more challenging goal is to learn well when the comparator changes over 
time. So now the sequence of trials is partitioned into sections. In each section 
the loss of the algorithm is compared to the loss of a particular expert and this 
expert changes at the beginning of a new section. The goal of the master now 
is to do almost as well as the best partition. Bounds of this type were first 
investigated by Littlestone and Warmuth m and then studied in more detail 
by Herbster and Warmuth Q and Vovk DU Other work on learning in relation 
to a shifting comparator but not in the expert setting appears in |‘2I1 Oil 4J| . 



In this paper we want to model situations where the comparators are from 
a small pool of to convex combinations of the n experts each represented by a 
probability vector Uj, (1 < j < to). In the initial segment a convex combination 
U\ might be the best comparator. Then at some point there is a shift and £t 2 
does well. In a third section, U\ might again be best and so forth. The pool size 
is small (to << n) and the best comparator switches back and forth between 
the few convex combinations in the pool (to << k , where k is the number of 
shifts). Of course, the convex combinations of the pool are not known to the 
master algorithm. 



This type of setting was popularized by an open problem posed by Yoav 
Freund [5J. In his version of the problem he focused on the special case where 
the pool consists of single experts (i.e. the convex combinations in the pool 
are unit vectors). Thus the goal is to develop bounds for the case when the 
comparator shifts back and forth within a pool of to out a much larger set of n 
experts. 

In [SJ bounds were developed where the additional loss of the algorithm over 
the loss of the best comparator partition is proportional to the number of bits 
needed to encode the partition. Following this approach Freund suggests the 
following additional loss bound for his open problem: log (") « to log — bits for 
choosing the pool of to experts, log?n bits per segment for choosing an expert 
from the pool, and log ( T ^ 1 ) ~ /clog j bits for specifying the k boundaries of 
the segments (where T is the total number of trials). 

In this paper we solve Freund’s open problem. Our methods build on those 
developed by Herbster and Warmuth 0. There are two types of updates: a Loss 
Update followed by a Mixing Update. The Loss Update is the standard update 
used for the expert setting PEEna in which the weights of the experts decay 
exponentially with the loss. In the case of the log loss this becomes Bayes rule 
for computing the posterior weights for the experts. In the new Mixing Update 




Tracking a Small Set of Experts by Mixing Past Posteriors 



33 



the weight vector in the next trial becomes a mixture of all the past posteriors 
where the current posterior always has the largest mixture coefficient. 

The key insight of our paper is to design the mixture coefficients for the 
past posteriors. In our main scheme the coefficient for the current posterior is 
1 — a for some small a £ [0,1] and the coefficient for the posterior d trials in the 
past is proportional to a/d. Curiously enough this scheme solves Freund’s open 
problem: When the comparators are single experts then the additional loss of our 
algorithms over the loss of the best comparator partition is order of the number 
of bits needed to encode the partition. For this scheme all past posteriors need to 
be stored requiring time and space 0(nt) at trial t. However, we show how this 
mixing scheme can be approximated in time and space 0(n In t). The simplest 
scheme has slightly weaker bounds: The coefficients of all past posteriors (there 
are t of them at trial t) are a\. Now only the average of the past posteriors 
needs to be maintained requiring time and space 0{n). 

We begin by reviewing some preliminaries about the expert setting and then 
give our main algorithm in Section 0 This algorithm contains the main schemes 
for choosing the mixture coefficients. In Section0we prove bounds for the various 
mixing schemes. In particular, we discuss the optimality of the bounds in relation 
to the number of bits needed to encode the best partition. We then discuss 
alternates to our main algorithm in Section 0 and experimentally compare the 
algorithms in Section 0 We conclude with a number of open problems. 



2 Preliminaries 

Let T denote the number of trials and n the number of experts. We will refer 
to the experts by their index i £ {1, . . . , n}. At trial t = 1 , ,T, the master 
receives a vector x t of n predictions, where Xt,i is the prediction of the z-th expert. 
The master then must produce a prediction y t and, following that, receives the 
true outcome y t for trial t. We assume that x t l . y t , yt £ [0, 1]. 

A loss function L : [0,1] x [0,1] —> [0,oo] is used to measure the discrep- 
ancy between the true outcome and the predictions. Expert i incurs loss L t y = 
L(y t ,Xt,i ) at trial t and the master algorithm A incurs loss L t ,A = L(y t ,yt)- 
For the cumulative loss over a sequence, we will use the shorthand notation 
L\..t,a = ]Ct=i Lt,A- 

The weight vector maintained by the algorithm at trial t is denoted by v t . 
Its elements are non-negative and sum to one, i.e. v t is in the n-dimensional 
probability simplex denoted by V n . 

Based on the current weight vector v t , and the experts predictions x t , the 
master algorithm uses a prediction function pred : V n x [0, 1]” — > [0, 1] to com- 
pute its prediction for trial t: yt = pred(u t , x t ). In the simplest case the pre- 
diction function is an average of the experts predictions pred(u, x) = v ■ x (See 
PI). Refined prediction functions can be defined depending on the loss function 
used m- The loss and prediction functions are characterized by the following 
definition. 



